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Empirical analysis of collective human behavior for extraordinary events in 

blogosphere 
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To uncover underlying mechanism of collective human dynamics, we survey more than 1.8 billion 
blog entries and observe the statistical properties of word appearances. We focus on words that show 
dynamic growth and decay with a tendency to diverge on a certain day. After careful pretreatment 
and fitting method, we found power laws generally approximate the functional forms of growth and 
decay with exponents around -1. We also observe news words whose frequency increase suddenly 
and decay following power laws. In order to explain these dynamics, we propose a simple model of 
posting blogs involving a keyword, and its validity is checked directly from the data. The model 
suggests that bloggers are not only responding to the latest number of blogs but also suffering 
psychological pressure from the divergence day. Our empirical results can be used for predicting the 
number of blogs in advance and for estimating the period to return to the normal fluctuation level. 
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INTRODUCTION 



Collective motions are recognized as a typical exam- 
ple of self-organization caused by neighbor interactions. 
Thereby it has been widely investigated by physics com- 
munity [l|, |2|. In particular, collective behavior in hu- 
man society has attracted considerable interest in the 
last decade. Because developments in information tech- 
nology have enabled the storage of large volumes of high- 
frequency human activity data. For instance, detecting 
bubbles in stock exchange activities Q, modeling dealer 
behavior using real data in the foreign exchange mar- 
kets [^ , and the empirical analysis of consumer behavior 
in supermarkets and convenience stores using purchase 
history and point of sales (POS) data [a, Q. Human ac- 
tivity data that is collected from the web, for example, 
YouTube videos and the social network service Facebook, 
are analyzed to not only explain basic individual human 
behavior but also elucidate hidden network structures in 
the society 0,[i|. 

Here we also use the data from the web to uncover 
non-trivial mechanism of collective human activities. 
Because word frequency on the web is expected to 
immediately reflect the real social mood, it has attracted 
increasing attention among many academic and indus- 
trial researchers. In fact, they are stored electronically 
and analyzed widely. For example, the Library of 
Congress in the United States which is the largest 



library in the world has been archiving the entire public 
tweet of Twitter, a micro-blogging system since 2007, 



( http://blog.twitter.com/2010/04/tweet-preservation.html ) 



* E-mail; |sano.yukie@mhon-u.ac.jp] 



A blog is a type of website that is maintained by an in- 
dividual with entries displayed chronologically with time 
stamps. The term "blog" originated from the combina- 
tion of "web" and "log," and was popularized around the 
year 2000 when free blog services began to be provided 
by internet service companies. A "blogger," who is an 
owner of a blog site, can easily upload his/her "entries" 
any time, and readers can easily post comments on the 
blog page. This interactive quality has contributed to the 
success of blogs; they are now widely used as basic social 
communication tools. The collective community of blogs 
is often called the "blogosphere." There exists a relation 
between the blogosphere and various scientific fields such 
as statistical physics, engineering, sociology, linguistics, 
and psychology. 

In this study, we analyze the keyword appearance rate 
in blogs in which the functional forms of growth and de- 
cay around the peak are approximated by power laws. 
For earthquake research, the frequency of aftershocks is 
reported to decrease following a power law of time after 
the main shock; this is known as Omori's Law [9|. Sim- 
ilar power laws have been established in other fields of 
human activity. For example, a power law can describe 
a decrease in online book sales with an exponent that 
depends on endogenous or exogenous shocks [l3|. Re- 
laxation in audience number for online movies can also 
be described by power laws with various values of expo- 
nents that refiect the quality of the content [3] . Alfi et 
al. found that growth in conference registration numbers 
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FIG. 1: (Color online) Temporal change of the word fre- 
quency of "April fool" per week. The results are from 
Google Trends, which is targeted worldwide and our blog data 
"KuchikomiQkakaricho," which is targeted only in Japan . 
The number of blogs is normalized by the whole number. 



is also approximated by the power law diverging at the 
deadline \U\ . 

In Sec. [21 we describe the analyzed data and Japanese 
blogs. In Sec. |31 we introduce our pretreatmcnt proce- 
dures and peaked words. In Sec. IH we focus on the time 
evolution of these peaked words and prove that they grow 
and decay with power laws. To reproduce power laws, we 
introduce a simple model of posting blogs in Sec. [5] In 
Sec.[6l we discuss the predictability of our model from the 
standpoint of application, and the final section is devoted 
to conclusions. 



2. DATA DESCRIPTION 

The data analyzed in this study was obtained from 
the blogosphere written in Japanese over a period of four 
years, from November 1st 2006 to October 31st 2010. 
According to the technical report by the internet search 
engine company Technorati ( http://tec hnorati.corri| , 
that tracked more than 70 million blogs worldwide in 
2007, the share of Japanese blogs is 37%, the largest 
among all languages. Although we only analyze the 
Japanese blogosphere, we show an example in which the 
dynamic properties in Japanese and English are consider- 
ably similar. Figure [T] shows the temporal change of the 
frequency of the English "April Fool" observed by Google 
Trends (|http://www. google. com/trends,) surveyed world- 



wide compared to the number of blog entries containing 
the corresponding Japanese. In both cases, we confirm 
that there is a clear peak on the week including April 
fools' day. 

In blogosphere research, it is important to note the 
existence of spam blogs. They are automatically gener- 
ated blogs in which the same words are repeated multiple 
times, mainly for the purpose of advertising. As the share 
of spams in the Japanese blogosphere is said to be 40%, 
it is important to exclude spams from the data. We used 
a new internet service called "KuchikomiQkakaricho" 
(http://kakaricho.jp) to collect the data. This service 



provides an application programming interface (API) 
that counts the number of entries in which a given target 
word appeared in a given period by using a search engine 
technology with a spam filter. There are three levels of 
spam filtering and we apply the middle level, which is 
known to remove most of the spams while keeping most 
of the human blogs untouched. The API counts the num- 
ber of entries in the blogs such that if one entry includes 
the target word multiple times, the word is counted only 
once. @ 

The API started crawling the blogosphere on Novem- 
ber 1st 2006 and covered major blog service providers. 
It covers more than 1.8 billion blog entries in 15 million 
blogs accounting for 90% of the Japanese blogosphere. 

For analysis of Japanese we introduced a pre- 
treatmcnt to separate Japanese words that arc not 
separated by spaces. Here wc use the com- 

monly used Japanese morphological analyzer "MeCab" 
(http://mecab.sourceforge.net/ 1 to individually separate 
words according to a dictionary. By adding words to 
its dictionary, this software can treat multi-word phrases 
such as "April Fool" as one word, "April- Fool" . Most 
of the words used in this study are already listed in the 
software's dictionary as one word, except names of peo- 
ple. 



3. PEAKED WORDS 

In the blogosphere, there are special words whose fre- 
quency grows or decays around a peak day such as "April 
Fool" with the peak on April 1st. In the following dis- 
cussion, we denote these words as "peaked words" and 
analyze their functional forms of growth and decay. 



3.1. Pretreatment 

We first apply the following pretreatment to the data 
to exclude both trivial circadian human activity patterns 
and systematic noises. 

Time-Shift In blogosphere, although a day starts at 
00:00:00, there are many bloggers who are active 
at midnight. Therefore, we examine the complete 
circadian activity pattern and introduce a type of 




Hours 



FIG. 2: (Color online) Probability density distribution of cir- 
cadian activity of blog posted by 10000 bloggers. Solid line is 
calculated from all entries and the red bar is from the entries 
excepted that have time stamp of 00:00:00. In both, 4:00 is 
the smallest ratio in a day. 



correction pretreatment for our daily data. For 
this purpose, we randomly chose the data of 10,000 
bloggers with the details of their activities time 
stamped in seconds. By counting the number of 
entries posted at every hour, a circadian activity 
pattern is plotted in Fig. [2] The solid line shows the 
24-hour-activity pattern obtained directly from the 
data. However, we discovered there are a certain 
number of blog entries with time stamps that are 
exactly 00:00:00. We consider this time stamp to 
be caused by an artificial systematic spec or error, 
and we exclude this data from the statistics when 
capturing the circadian pattern. The red bars in 
Fig. [5] show the revised circadian activity pattern. 
Using a 24-hours clock, we find that blogging activ- 
ity is lowest around 4:00, and thus we consider the 
start of a day at 5:00 to be reasonable. Because the 
share of activity in the interval between 0:00 and 
5:00 is approximately 10% of the complete activity 
of a day, we can correct the daily number of blog 
entries including the j-th target word at the t-th 
day Xj (t) by the following equation 



x'-(t) = WXj{t) + (1 - w)xj{t + 1), 



(1) 



where the weight is set a,s w = 0.9. With this modi- 
fication, we can determine the time-shifted time se- 
ries. In Fig. [31 open circles show the original daily 
data in which w = 1.0 in Eq. ([T]), and colored circles 
show time-shifted data in which w = 0.9. The time- 
shifted data shows a more symmetric pattern than 
the original data. We also apply this procedure 
to determine the time series of the total number 
of blog entries per day x'{t). To clarify the effect 
of this time-shift procedure, we also show results 
without this time-shift procedure in Appendix [X] 

Normalization There are non-uniform and non- 
stationary properties in the total number of entries 
per day |12| . For example, there was a sudden drop 



in February 2007 that was caused by search en- 
gine software's system maintenance. In order to 
reduce the systematic fluctuations caused by such 
non-uniform properties, we apply the following nor- 
malization procedure. There is already a method 
to separate internal and external noises [l3| , which 
simply deducts external factor depending on its ra- 
tio of the total traffic. They assume that each traf- 
fic x'Jt) in a small component j G N is consisted 
of the total traffic x'{t) without overlap, where 
J2j=i^ji^) — x'{t)- Here, we simply divide x'j{t) 
by the total number, x'{t). The normalized num- 
ber of entries for the j'-th word on the i-th day is 
defined by Xj{t) = x'j{t)0Jr, where {x') denotes 

the mean value of x'{t) that is averaged over the 
entire observation period. This normalized quan- 
tity is proportional to the probability that a blog 
contains the j-th word on the t-th day, and it is not 
necessarily an integer. 

By introducing this normalization, the fluctuations 
caused by the aforementioned non-uniform proper- 
ties can be reduced. In this study, we measure the 
word frequency using this normalization procedure. 



12000 


J 


1 1 


1 L 






Q 


-0-- w=l 


10000 


- 




-•- w=0.9 - 


8000 


- 




- 


o 6000 

CD 


- 




- 


4000 


- 


i Q 


- 


2000 


- 


J^ 


- 







pi^i^ , ^ 


msmmmsmm 



08.7.1 



08.7.11 



08.7.31 



08.8.10 



FIG. 3: (Color online) Typical example of time series of 
peaked word "Marine Day" in 2008. w = 1 corresponds to no 
revision and w — 0.9 corresponds to modified time series in- 
troduced in Sec. 13.11 Because of the circadian effect, the data 
of the day after the peak is always higher than that before 
the peak without modification. 



3.2. Word Selection 

We determine candidates for peaked words in the fol- 
lowing three categories. 

Event We selected the names of 14 public holidays and 
16 major annual events in Japan. The appearance 
for these words grows and decays around the date 
of the event. In addition, these are words affiliated 



with an event, such as "Santa Claus" for "Christ- 
mas" and we can observe similar growth and decay 
behavior for those words. However, in this analysis 
we neglected such afhliated words. 

Date We selected dates such as "9th May," resulting in 
365 words. There are many blog entries that an- 
nounce some special day, e.g., birthday and festi- 
val. Growth and decay of these words always show 
a clear peak at the date. 

News A word such as "earthquake" occurs suddenly 
right after the occurrence of the event and the word 
appearance rate generally decays slowly. In order 
to observe the functional form of such decay af- 
ter a significant event, we selected names of the 
places impacted by earthquakes. We also selected 
33 names of famous people who died suddenly. In 
addition, we included the names of the Japanese 
scientists who received a Nobel Prize during our 
observation period. 



4. DYNAMICS OF PEAKED WORDS 

We call the slopes before the peak day fore-slopes and 
those after the peak day after-slopes, and we examine 
both. As the study on power law statistics has not 
been cultivated enough, there is no standard method to 
check the validity of a power law approximation for given 
data [IJI- A promising statistical test for power law 
distributions is based on Kolomogorov-Smirnov statis- 
tics [15|, here, we generalize the method to time series. 



4.1. Method 

We define the number of days in each slope by the num- 
ber of consecutive days in which the word frequency is 
larger than the median value Xj from the peak. The me- 
dian value is estimated throughout the entire observation 
period. Then we approximate the functional form of the 
slopes using two models, a power law and an exponential 
law. 



Xj(t) 



Cj — Aj\tc 



tl 



Xj{t) 



Bj cxp (-/3j \tc - t\ 



(2) 
(3) 



The parameters of these models, aj, Aj, Pj, and Bj, 
arc determined by the least squares method. The fitting 
region is [ic i 1 , ^c i n] where n is the number of days in 
slope. Then we apply the Kolmogorov-Smirnov goodness 
of fit test, for choosing the better model. It was originally 
used as a statistical test for distributions. Here, we apply 
it for evaluation of the statistical fitness of the functional 
form of the time scries. For both models we calculate the 
KS statistic D, representing the deviation, is defined as 



where x^-^™'^^"''"' (t) is the cumulative number of empiri- 
cal value which is counted from the data, and X -'"° ^ ' (t) 
is the cumulative number which is calculated from the 
model. In both cases, numbers are normalized by Xj (ici 
1). By comparing the values of D for both models, the 
power law model is accepted if the _D-value for the power 
law is smaller. In the case that the power law is accepted, 
we check the validity of the model as introduced in [Tj] . 
First we generate 1000 synthetic data set. One data set 
contains n data points. Synthetic data points arc gener- 
ated randomly following the normal distribution with the 
mean value is best estimated from the model x "^° '^ (t) 
and standard deviation is (T(a;^™° "^ (^)) '<^^ follows 



a{{x,))c^J{x,){l + a^x,)), 



(5) 



1 Vi^^a 

where a = ^ ,^, = 



D = max 

te[tc±i,tc±"] 



X 



{empirical) 



{t)-x 



(model) 



it) , (4) 



0.08 is a constant parameter char- 
acterizing the fluctuation in the number of all bloggers 
which is determined independently of the word (see Ap- 
pendix |B] for theoretical derivation of Eq. ([S]))- For each 
synthetic time series, we compare its Z?-value with that 
of the empirical one. We count the number of cases in 
which the D-value for the synthetic time series is larger. 
If the number of such cases are less than 100 out of the 
1000 synthetic samples, we accept the power law model 
as q = 0.1. Contrary to the ordinary sense of p- value, 
the power law hypothesis is considered to be valid for 
larger q. Thus if the q is close to 1, then the difference 
between empirical data and the model can be attributed 
to statistical fluctuation alone and we accept power law 
hypothesis. If the q is smaller than 0.1, we reject power 
law hypothesis. We change the border of the fitting re- 
gion n from 5 days to maximum slope length. The value 
of power exponent, aj, is given bye the value for the case 
with the largest n. 



4.2. Results 

Figure |4] is a typical result of data fitting for the word 
"Marine Day" in 2008 with log-log scale, as shown in Fig. 
|3]with linear scale. For all cases of power law fitting, the 
distribution of the estimated power exponents are shown 
in Fig. [5] and summarized in Tab. ID 

The absolute value of the power exponents of the after- 
slopes is larger than that of the fore-slopes in 58% of the 
65 samples for Event, and 80.6% of the 603 samples for 
Date. For Date, we confirm significant difference between 
fore-slopes and after-slopes by t-test with p-valuc < 2 x 
10~^^ while it is rejected with p-value = 0.80 for Event. 
The number of days of the after-slopes is larger than that 
of the fore-slopes in 55% of the 65 samples for Event, and 
65.8% of the 603 samples for Date. For Date, we confirm 
significant difference between fore-slopes and after-slopes 
by KS-test with p- value < 2 x 10~^^ while it is rejected 
with p-value = 0.22 for Event. 
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FIG. 4: (Color online) Examples of data fitting by power laws 
of "Marine Day" in 2008 for fore-slope (a) and after-slope (b) 
plotted in log-log scale. For fore-slope, models are fitted by 
Eq. (O with aj = 1.27 and Aj = 3100 {q = 0.239, n = 58). 



For after-slope, Qj = 1.42 and Aj = 3171 (q = 0.108, 



18). 



The shaded area shows the interval n in which the power law 
model is accepted. 



In the case of the news words, there is no fore-slope and 
we cannot compare the values of the exponents before 
and after the peak. The absolute values of the exponent 
after the peak tend to be esthiiated as smaller for high 
impact news because of the effect of sequential broadcasts 
after the news. For example, in the case of the sudden 
death of the world famous entertainer Michael Jackson, 
which marked the peak day, there was a funeral service 
after a few days and a memorial CD released after a few 
weeks. Both can be regarded as aftershocks that remind 
us of the main news. Because of such repetition, the 
keyword appearance rate after the peak day is enhanced, 
the decay of the word appearance becomes slower, and 
the power exponent tends to take a smaller value. 



4.3. An extreme case "Tsunami" 

The power law decay per day of the word "Tsunami" 
in the Japanese blogosphere is shown in Fig. [6ja). The 
peak day was March 12th 2011, the day after the quake 
with 142617 posts or 12.6 % of all blog posts in raw data. 
After pretreatment of time-shift and normalization, the 



TABLE I: Mean values of power exponent aj with standard 
deviations and medians of slope days n. 

aj n (days) # samples 

Event Before 1.40 ± 0.38 10 83 

After 1.44 ± 0.28 16 91 

Date Before 0.79 ± 0.38 9 776 

After 1.11 ± 0.16 21 1229 

News After 1.09 ± 0.45 10 21 

All Before 0.85 ± 0.30 9 859 

After 1.13 ± 0.21 20 1341 
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FIG. 5: (Color online) Distribution of power exponent aj of 
the fore-slopes (a) and the after-slopes (b). Mean value of aj 
of fore-slope is 0.85 ± 0.30 and after-slope is 1.13 ± 0.21. 



estimated power exponent aj is 0.67 with Aj = 61788 
(n=50) using Eq. ^. It is expected to take approxi- 
mately 8623 days (~ 23.4 years) to return to the normal 
fluctuation level if we simply broaden power law func- 
tion. The normal fluctuation level was 140 appearances 
per day, estimated from the data one month before the 
quake. Although most of the news words decay in ap- 
proximately 10 days, the case of "Tsunami" is a rare 
exception because the number of entries is still ten times 
higher than before the peak even for a year after the 
quake. 

Twitter also shows similar power law behavior even 
though the time resolution is different. Figure|6l^b) shows 
the number of tweets measured per hour that include 
"Tsunami" that is calculated based on 1397783 tweets. 
We believe that this type of power law reflects the robust- 



ness of the empirically observed dynamics of collective 
human behavior. 
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FIG. 6: (Color online) Decay of "Tsunami" observed for blogs 
(a) and tweets (b). Horizontal time step size is per day for 
blogs and per hour for tweets. (Inset) Log-log plots of the 
time series. Red dashed lines show the slope of power law 
with the exponent qj = 0.67 for blogs and qj — 0.75 for 
tweets. 



5. THE MODEL 

In this section, we propose a simple dynamic model 
to describe the typical power law growth and decay of 
frequency of blogs with peaked words. There is already 
a simple model to describe people's universal behavior 
before a deadline by assuming pressure inversely propor- 



tional to the remaining time [11[. As this simple model 
can describe only the special case a = 1, a kind of utility 
function that includes the tendency to postpone the ac- 
tion is introduced to describe the general case. Here, we 
introduce another approach to describe the general case. 
We use the following two assumptions for the number 
changes of blogs ^Xj{t) = Xj{t + 1) — Xj{t), increments 
for fore-slope and decrements for after-slope. 

1. The pressure from the peak day tc works inversely 
proportional to the time, l/|ic — t\ \Vi\ . 

2. The number of changes Axj{t) is proportional to 
the number of blogs Xj{t). 



We can write these two assumptions into mathematical 

. . A / N dxj(t) 

form m contmuous case as we assume Axilt) ~ — - — . 

^^ ' dt 

The time evolution of blogs for the fore-slope is given as 

dXj{t) _ (/ore) Xj{t) 



dt 



a,; 



(tc - t) 



fit), 



(6) 



where f(t) is an independent noise with zero mean. The 



value 



(fore) 



> is a proportionality factor that de- 



scribes the effect of the above-mentioned two assump- 
tions. Similarly, the decrement of the after-slope is given 



dXj (t) _ (after) X-j (t) 



dt 



(after) 



[t - Q 



fit), 



(7) 



where a, > is also a proportionality factor that 

describes the effect of the two assumptions. Because we 
know that blogs decrease after tc, we use a negative value 
in the Eq. ^. It confirms that Eq. ^ and Eq. (0 
derive the solution of Eq. ^ without the noise term. 

Thus, Xj{t) ex {tc — t) ' for fore-slopes and Xj{t) oc 

_^{after) 

(t — tc) ^ for after-slopes. In the case that there is 

no pressure from the peak day tc, blog dynamics follow 
Eq. ^ of the exponential law. We rewrite Eqs. ([6|) and 
([7]) into the following discrete form without noise term 
/(t) and check the validity from the data in Fig. [T] 



|Aa^j-(t)| 

Xjit) 



\t-tc\ 



(8) 



Note that tc is not necessarily an integer since the diver- 
gence point is expected to exist in single day time period. 
This result suggests that bloggers not only respond to the 
number of blogs but they also get pressure from the peak 
day tc- 



2.U 




After 

• 


1 

Median 












1.5 




▲ 


25% percentile 
75% percentile 










■ 


1.0 


- 




Model 








-/ 


A 


0.5 
0.0 


■ ■ 
-• • 


• •« 


1 


■ 


tt 


A 


A 






A A 


A Ai 



3 4 5 6 7 8 9 



3 4 5 



0.1 
1/{t-t,)(days) 

FIG. 7: (Color online) Relationship between difference rates 
\Axj{t)\/xj{t) and the absolute inverse number of time from 
the peak calculated by Eq. ^ for after-slopes, which summa- 
rized 1341 samples shown in Tab. IT] Solid line shows model 
with empirically observed exponent qj = 1.13. 



6. PREDICABILITY OF FREQUENCY 



CONCLUSIONS 



As an application of this study, wc explore the possi- 
bility of estimating the word frequency in the near fu- 
ture. In Fig. IHl we show an example of prediction of 
blog frequency "Marine Day" in 2008. In this case, we 
already have the information about the peak days to be 
July 21st 2008; thus, we can fix the divergence point 
tc- From the data, we find that the slope period starts 
on April 28th, 85 days before tc, as the normalized fre- 
quency continuously exceeds the median value from this 
day. In Fig.ElJa), the case of prediction for 20 days before 
the divergence point using 65 data points with Eq. ^ is 
shown by the red line. In Fig.|Sl[b), the case of prediction 
for 5 days before the peak day is shown. The prediction 
error becomes smaller for shorter prediction period as ex- 
pected. Note that a small difference in estimation of the 
exponent aj makes a big difference near the peak; thus, 
the number of data points plays an important role in its 
accuracy. 
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FIG. 8: (Color online) Typical examples of prediction for (a) 
20 days and (b) 5 days before the peak day for "Marine Day" 
in 2008 in semi- logarithmic scale. Red line indicates the pre- 
diction line, blue dashed line indicates median Xj = 17, and 
solid line shows the real values. Open circles indicate the 
future values and colored circles are the known values used 
for prediction. The estimation is started 85 days before t^. 



Estimated values are aj 
before the peak day and a-, 
before tc. 



1.78 and Aj = 
1.38 and A. 



20222 for 20 days 
= 4725 for 5 days 



By analyzing a large database of Japanese blogs, we 
showed that the functional forms of growth and decay of 
word appearance that peaked on a certain day are gen- 
erally approximated by power laws with the exponents 
around -1. The values of the power exponents depend on 
the category of words such as Event, Date, and News. In 
the case of Event and Date, clarification of asymmetry 
in the power exponents of the fore-slope and after-slope 
is an interesting subject for future research on collective 
human behavior. In the case of News, the power law can 
be observed only after the peak, and its power exponent 
depends on its impact. In the case of significant news 
such as the March 11th earthquake in 2011, the absolute 
value of the power exponent is clearly smaller than 1. 

We also checked the validity of a simple model that 
indicates that bloggers change their probability of post- 
ing proportional to the number of blogs and inversely 
proportional to the time interval from the peak. The 
model suggests that bloggers do not only respond to the 
number of blogs but also they spontaneously get pressure 
from a peak day. In addition, these power functions can 
be observed also in Twitter, and it suggests that these 
power law behaviors are universal in social phenomena. 
An agent-based mathematical model will be used repro- 
duce these empirical properties of blogger activity in the 
near future llql. 
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Appendix A: Case without time-shift 

Wc show the results without the time-shift; thus 
Eq. ([T]) with w — 1. Figure IH] shows a typical example of 
data fitting without time-shift for the word "Marine Day" 
as mentioned in Sec. 14.21 There is no major change in 
power exponent aj for fore-slope and after-slope. How- 
ever, for the value of intercept Aj, we can find major 
deviation especially for fore-slope {Aj = 3171 with time- 
shift and Aj = 2273 without time-shift). In Fig. [TOl and 
Table HIl we summarize the whole samples. 



Appendix B: Modified Random Diffusion Model 

Wc introduce a modified random diffusion model, 
which is used in Eq. ([5]). The random diffusion model 
was originally introduced to describe diffusion properties 
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FIG. 9: (Color online) Examples of data fitting without time- 
shift pretreatment by power laws of "Marine Day" in 2008 for 
fore-slope (a) and after-slope (b) plotted in log-log scale. For 
fore-slope, models are fitted by Eq. ((2]) with qj = 1.10 and 
Aj = 2273 {q = 0.103, n = 26). For after-slope, Oj = 1.44 
and Aj = 3369 {q = 0.104, n — 18). The shaded area shows 
the interval in which the power law model is accepted with 
the p- value less than 0.1. 
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FIG. 10: (Color online) Distribution of power exponent aj 
of fore-slopes (a) and after-slopes (b) without time-shift. 
Mean value of Oj of fore-slope is 0.73±0.35 and after-slope 
is 1.16±0.21. 



blogger randomly decides to post a blog including the 
j-th word. There is a key parameter in this stochastic 
process; the share of the j-th word Cj is defined by the 
following equation 



TABLE II: Mean values of power exponent Uj with standard 
deviations and medians of slope days n in case without time- 
shift procedure which introduced in Sec. 13.11 
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Event Before 
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After 


1.48 ± 0.28 
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Date Before 


0.64 ± 0.25 


11.5 




418 


After 


1.14 ± 0.18 
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News After 


1.21 ± 0.35 


11 




18 


All Before 


0.73 ± 0.35 
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498 


After 


1.16 ± 0.21 


21 




1279 



where Xj (t) is the number of blog entries including the 
j-th word on the i-th day. X{t) is the number of active 
bloggers on the i-th day and the brackets show the mean 
over all instances. We assume that the number of active 
bloggers X{t), X{t) > fluctuates randomly following an 
independent probability density distribution (j){X) with 
finite moments. Probability of posting Xj entries is calcu- 
lated using a Poisson distribution with the mean number 
CjX given as follows 



of random walkers on a given network [l7J, llSf , and two 
of the authors (Y.S. and M. T.) have modified the model 
to be applicable to the fluctuations in word appearance 
in the blogosphere [l3|. In our modified random diffu- 
sion model, we assume that there are two states, active 
and non-active for each blogger, and the number of ac- 
tive bloggers fluctuates randomly each day. Each active 



Pi^ilc,) 



<I){X) exp i-CjX) 



(CjXr 



-dX. (B2) 



When (xj) is small, a Poisson distribution is approxi- 
mated by a Bernoulli distribution that assumes Xj ~ 



with a probability 1 



ity CjX . 



CjX, 



and Xj = 1 with a probabil- 



Thus. we have the following evaluations for an 



arbitrary distribution of (f>{X). 



P{x,=0\c,) ^ / cj,{X)il~c,X)dX 
Jo 

^ l-c,(X), 

P{xj = l\cj)c^ / (l){X){cjX)dX 
Jo 



,{X). 



(B3) 



For (xj) w 2, P{xj > 2\cj) w 0, thereby P{xj\cj) is 
approximated by the Poisson distribution with both the 
mean and the variance given by Cj{X). 
For (xj) 3> 1, the Poisson distribution can be approxi- 
mated by a normal distribution, 
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exp 
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By introducing a new variable j/j 
becomes 



ixj~c,xr 

2cjX 

"(B4) 

^, Eq. m 
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(B5) 

When (xj) = Cj{X) ^ 1, the weight function in the 
integral can be approximated by Dirac's delta function 
as 

P{y,\c,) o, j^ cf>{X)6 (^y, - ^) dX. (B6) 

Therefore, we have the following simple evaluation, for 



Prom these results, the standard deviation ctj 
?) ~ i^j) <^^^ t)*^ expressed as 
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FIG. 11: (Color online) Relationship between mean and stan- 
dard deviation of word frequency in the blogosphere. Em- 
pirical results of 1771 adjectives and theoretical result of 
Eq. (jB12|l are duplicated in the figure. 



By correlating both results Eqs. (|B3I) and (jBlOp . we can 
get the following relation; 



C,(X)-fC,2(X2)^, 



(Bll) 



P(x,|c,)^-0 -^ 



(B7) 



Calculating the first and second moments of P(xj|cj), we ing 
now have the general results 

/>oo 

{xj) = / XjP{xj\cj)dxj 
Jo 

^ rxj-c^(^)dx,^c,{X), (B8) 

Jo Cj \ Cj / 

Xj-r[Xj\Cj)aXj 



{^) = 



1 . fx 



x;-^ 



where {X"^) ^ denotes the second order cumulant. By us- 



Cj{X), we rewrite Eq. (|Blip into 



\ 



(X2 



(X 



2 \'^J/ 



(B12) 



dxj ~Cj'^{X'^). (B9) Figure [TT] shows empirical results using 1771 adjectives 

and Eq. (IBT2)) with ^^^ = 0.08. 
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